METHOD FOR CONVERGING A G.729 ANNEX B COMPLIANT 
VOICE ACTIVITY DETECTION CIRCUIT 
Dunling Li, Dan Thomas, Gokhan Sisli 

FIELD OF THE INVENTION 

The invention relates to improving the estimation of background noise energy in a 
communication channel by a G.729 voice activity detection (VAD) device. Specifically, 
the invention establishes a better initial estimate of the average background noise energy 
and converges all subsequent estimates of the average background noise energy toward its 
actual value. By so doing, the invention improves the ability of the G.729 VAD to 
distinguish voice energy from background noise energy and thereby reduces the 
bandwidth needed to support the communication channel. 

BACKGROUND OF THE INVENTION 

The International Telecommunication Union (ITU) Recommendation G.729 
Annex B describes a compression scheme for communicating information about the 
background noise received in an incoming signal when no voice activity is detected in the 
signal. This compression scheme is optimized for terminals conforming to 
Recommendation V.70. The teachings of ITU-T G.729 and Annex B of this document 
are hereby incorporated into this application by reference. 

Traditional speech encoders/decoders (codecs) use synthesized comfort noise to 
simulate the background noise of a communication link during periods when voice 
activity is not detected in the incoming signal. By synthesizing the background noise, 
little or no information about the actual background noise need be conveyed through the 
communication channel of the link. However, if the background noise is not statistically 
stationary (i.e., the distribution function varies with time), the simulated comfort noise 
does not provide the naturalness of the original background noise. Therefore it is 
desirable to occasionally send some information about the background noise to improve 
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the quality of the synthesized noise when no speech is detected in the incoming signal. 
An adequate representation of the background noise, in a digitized frame (i.e., a 10 ms 
portion) of the incoming signal, can be achieved with as few as fifteen digital bits, 
substantially fewer than the number needed to adequately represent a voice signal. 
Recommendation G.729 Annex B suggests communicating a representation of the 
background noise frame only when an appreciable change has been detected with respect 
to the previously transmitted characterization of the background noise frame, rather than 
automatically transmitting this information whenever voice activity is not detected in the 
incoming signal. Because little or no information is communicated over the channel 
when there is no voice activity in the incoming signal, a substantial amount of channel 
bandwidth is conserved by the compression scheme. 

Figure 1 illustrates a half-duplex communication link conforming to 
Recommendation G.729 Annex B. At the transmitting side of the link, a VAD module 1 
generates a digital output to indicate the detection of noise or voice energy in the 
incoming signal. An output value of one indicates the detected presence of voice activity 
and a value of zero indicates its absence. If the VAD 1 detects voice activity, a G.729 
speech encoder 3 is invoked to encode the digital representation of the detected voice 
signal. However, if the VAD 1 does not detect voice activity, a Discontinuous 
Transmission/Comfort Noise Generator (noise) encoder 2 is used to code the digital 
representation of the detected background noise signal. The digital representations of 
these voice and background noise signals 7 are formatted into data frames containing the 
information from samples of the incoming analog signal taken during consecutive 10 ms 
periods. 

At the decoder side, the received bit stream for each frame is examined. If the 
VAD field for the frame contains a value of one, a voice decoder 6 is invoked to 
reconstruct the analog signal for the frame using the information contained in the digital 
representation. If the VAD field for the frame contains a value of zero, a noise decoder 5 
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is invoked to synthesize the background noise using the information provided by the 
associated encoder. 

To make a determination of whether a frame contains voice or noise activity, the 
5 VAD 1 extracts and analyzes four parametric characteristics of the information within the 
frame. These characteristics are the full- and low-band noise energies, the set of Line 
Spectral Frequencies (LSF) 5 and the zero cross rate. A difference measure between the 
extracted characteristics of the current frame and the running averages of the background 
noise characteristics are calculated for each frame. Where small differences are detected, 
10 the characteristics of the current frame are highly correlated to those of the running 
averages for the background noise and the current frame is more likely to contain 
background noise than voice activity. Where large differences are detected, the current 
frame is more likely to contain a signal of a different type, such as a voice signal. 

15 An initial VAD decision regarding the content of the incoming frame is made 

using multi-boundary decision regions in the space of the four differential measures, as 
described in ITU G.729 Annex B. Thereafter, a final VAD decision is made based on the 
relationship between the detected energy of the current frame and that of neighboring past 
frames. This final decision step tends to reduce the number of state transitions. 

20 

The running averages of the background noise characteristics are updated only in 
the presence of background noise and not in the presence of speech. Therefore, an update 
occurs only when the VAD 1 has identified an incoming frame containing noise activity 
alone. The characteristics of the incoming frame are compared to an adaptive threshold 
25 and an update takes place only if the following three conditions are met: 

1) E f <E fjavg +3dB; 

2) RC(1)< 0.75; and 

3) aSD< 0.0637; 

where, 
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E f = the full-band noise energy of the current frame and is calculated using the 
equation: 

1 



E f = 10xlog 10 



240 



x7?(0) 



, where R(0) is the first autocorrelation coefficient; 



E f avg = the average full-band noise energy; 
5 RC(1) = the first reflection coefficient; and 

aSD = the difference between the measured spectral distance for the current frame 

and the running average value of the spectral distance, with a aSD of 0.0637 

corresponding to 254.6 Hz. 
The full-band noise energy E f is further updated, as is a counter, C n , of noise frames 
10 according to the following conditions. 

E f , avg = E min ; and 

c n -o 5 

when, 

C n > 128; and 
15 E^ < E . 

When a frame of noise is detected, the running averages of the background noise 
characteristics are updated to reflect the contribution of the current frame using a first 
order Auto-Regressive (AR) scheme. Different AR coefficients are used for different 

20 parameters, and different sets of coefficients are used at the beginning of the 

communication or when a large change of the noise characteristics is detected. The 
running averages of the background noise characteristics are initialized by averaging the 
characteristics for the first thirty-two frames (i.e., the first 320ms) of an established link. 
Frames having a full-band noise energy E f of less than -70 dBm are not included in the 

25 count of thirty-two frames and are not used to generate the initial running averages. 
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Based on the conditions established by G.729 Annex B, described above, for 
updating the running averages of the background noise characteristics, there are common 
circumstances that cause the running averages to substantially diverge from the 
background noise characteristics of the current and future frames. These circumstances 

5 occur because the conditions for determining when to update the running averages are 
dependent upon the values of the running averages. Substantial variations of the 
background noise characteristics, occurring in a brief period of time, decrease the 
correlation between the current background noise characteristics and the expected 
background noise characteristics, as represented by the running averages of these 

10 characteristics. As the correlation diverges, the VAD 1 has increasing difficulty 

distinguishing frames of background noise from those containing voice activity. When 
the divergence reaches a critical point, the VAD 1 can no longer accurately distinguish 
the background noise from voice activity and, therefore, will no longer update the running 
averages of the background noise characteristics. Additionally, the VAD 1 will interpret 

15 all subsequent incoming signals as voice signals, thereby eliminating the bandwidth 
savings obtained by discriminating the voice and noise activity. 

Without some modification to the algorithm described in Recommendation G.729 
Annex B, once the running averages of the background noise characteristics and the 
20 actual characteristics become critically diverged, the VAD 1 will not perform as intended 
through the remaining duration of the established link. Critical divergence occurs in real- 
world applications when: 

1 . The VAD receives a very low-level signal at the onset of the channel link 

and for more than 320ms; 
25 2. The VAD receives a signal that is not representative of the subsequent 

signals at the onset of the channel link and for more than 320ms; and 

3. The characteristic features of the background noise change rapidly. 
In the first instance, the vector containing the running average of the background noise 
characteristics is initialized with all zeros. In the second instance, the vector contains 
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values far removed from the real background noise characteristics. And in the third 
instance, the spectral distance differential, aSD, will never be less than 0.0637. As the 
VAD 1 increasingly allocates resources to the conveyance of noise through the 
communication channel 4, it proportionately decreases the efficiency of the channel 4. 
5 An inefficient communication channel is an expensive one. The present invention 
overcomes these deficiencies. 



For completeness, a description of the parameters used to characterize the 
background noise are described below. Let the set of autocorrelation coefficients 
10 extracted from a frame of information representing a 10 ms portion of an incoming signal 
be designated by: 

{RQ)}]to 

A set of line spectral frequencies is derived from the autocorrelation coefficients, in 
accordance with Recommendation G.729, and is designated by: 

15 {LSF,}^ 

As stated previously, the full-band energy E f is obtained through the equation: 

, where R(0) is the first autocorrelation coefficient; 



jE / = 10xlog 1 . 



— xR(0) 
240 V 



The low-band energy, measured between the frequency spectrum of zero to some upper 
frequency limit, F ]? is obtained through the equation: 

, where h is the impulse response of an FIR filter with 



20 ^-lOxlog, 



1 7* 

240 



a cutoff frequency at F 1 Hz and R is the Toeplitz autocorrelation matrix with the 
autocorrelation coefficients on each diagonal. 

The normalized zero crossing rate is given by the equation: 
25 2C=^x ^[|sgn(x(z))-sgn(x(/-l)|] , where x(i) is the pre-processed input signal. 
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For the first thirty-two frames, the average spectral parameters of the background 
noise, denoted by {LSF avg }, are initialized as an average of the line spectral frequencies of 
the frames and the average of the background noise zero crossing rate, denoted by ZC avg , 
is initialized as an average of the zero crossing rate, ZC, of the frames. The running 
5 averages of the full-band background noise energy, denoted by E f avg , and the background 
noise low-band energy, denoted by E { avg , are initialized as follows. First, the initialization 
procedure substitutes E n avg for the average of the frame energy, E f , over the first thirty- 
two frames. The three parameters, {LSF avg }, ZC avg , and E n avg , include only the frames that 
have an energy , E f , greater than -70dBm. Thereafter, the initialization procedure sets the 
10 parameters as follows: 

If E navg < T l5 then 

E., avg = E n , avg - 53,687,091 
else if Tj < E n avg < T 2 , then 
15 E f>avg = E n;avg - 67,108,864 

E^vg = E n , avg - 93,952,410 

else 

E f ,av g = E n;avg - 134,217,728 
E Uvg = E n?avg - 161,061,274 
20 A long-term minimum energy parameter, E min , is calculated as the minimum value of E f 
over the previous 128 frames. 



Four differential values are generated from the differences between the current 
frame parameters and the running averages of the background noise parameters. The 
25 spectral distortion differential value is generated as the sum of squares of the difference 
between the current frame {LSFX1, vector and the running averages of the spectral 
distortion {LSF avg } and may be expressed by the equation: 

AS=f (LSF 3 ~LSF^) 2 
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The full-band energy differential value may be expressed as: 

AE f = E f avg ~ E f , where E f is the low-band energy of the current frame. 

The low-band energy differential value may be expressed as: 

AE; = E lavg -E n where E { is the low-band energy of the current frame. 

5 Lastly, the zero crossing rate differential value may be expressed as: 

AZC= ZC^ -ZC , where ZC is the zero crossing rate of the current frame. 

SUMMARY OF THE INVENTION 

Since the problem occurs with communications conforming to ITU G.729 Annex 
10 B 5 the solution to the problem must improve upon the Recommendation without departing 
from its requirements. The key to achieving this is to make the condition for updating the 
background noise parameters independent of the value of the updated parameters. The 
solution includes: 

1. eliminating all of the frames having a very low level, such as below 

15 -70dBmO, from: (a) updating the background noise characteristics established at 

the beginning of call setup for the link and (b) contributing toward the frame count 
used to determine the end of the initialization period; 

2. providing a supplemental background noise identification algorithm that 
averages the background noise characteristics for all frames satisfying the 

20 conditions of step (1), above; 

3 . occasionally comparing the average background noise characteristics 
obtained using the methodology described in G.729 Annex B to those obtained 
using the supplemental algorithm; and 

4. substituting the background noise characteristics obtained using the 

25 supplemental algorithm for those obtained using the G.729 Annex B methodology 

whenever the two sets of characteristics have diverged substantially. 
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The supplemental algorithm establishes two thresholds that are used to maintain a 
margin between the domains of the most likely noise and voice energies. One threshold 
identifies an upper boundary for noise energy and the other identifies a lower boundary 
for voice energy. If the block energy of the current frame is less than the noise energy 

5 threshold, then the parameters extracted from the signal of the current frame are used to 
characterize the expected background noise for the supplemental algorithm. If the block 
energy of the current frame is greater than the voice threshold, then the parameters 
extracted from the signal of the current frame are used to characterize the current voice 
energy for the supplemental algorithm. A block energy lying between the noise and voice 

10 thresholds will not be used to update the characterization of the background noise or the 
noise and voice energy thresholds for the supplemental algorithm. 

The supplemental algorithm is used to update both the characterization of the noise 
and the voice energy thresholds, whenever the block energy of the current frame falls 

15 outside the range of energies between the two threshold levels, and the running averages 
of the background noise when the block energy falls below the noise threshold. Because 
the noise and voice threshold levels are determined in a way that supports more frequent 
updates to the running averages of the background noise characteristics than is obtained 
through the G.729 Annex B algorithm, the running averages of the supplemental 

20 algorithm are more likely to reflect the expected value of the background noise 
characteristics for the next frame. By substituting the supplemental algorithm's 
characterization of the background noise for that of the G.729 Annex B algorithm, the 
estimations of noise and voice energy may be decoupled and made independent of the 
G.729 Annex B characterization when divergence occurs. Both the noise threshold and 

25 voice threshold are based on minimum and maximum block energy during one updating 
period and are updated every 1.28 seconds. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Preferred embodiments of the invention are discussed hereinafter in reference to 
the drawings, in which: 

5 Figure 1 - illustrates a half-duplex communication link conforming to 

Recommendation G.729 Annex B; 

Figure 2 - illustrates representative probability distribution functions for the 
background noise energy and the voice energy at the input of a G.729 Annex B 
communication channel; 
10 Figure 3 - illustrates the process flow for the integrated G.729 Annex B and 

supplemental VAD algorithms; 

Figure 4 - illustrates a continuation of the process flow of Figure 3; 
Figure 5 - illustrates a test signal representing a speaker's voice provided to a 
G.729 Annex B communication link and the G.729 Annex B VAD response to this input 
15 signal; 

Figure 6 - illustrates the test signal of Figure 4 with a low-level signal preceding it, 
the G.729 Annex B VAD response to the combined test signal, and the supplemental 
VAD response to the combined test signal; 

Figure 7 - illustrates a conversational test signal provided to a G.729 Annex B 
20 communication link, the response to the test signal by a standard G.729 Annex B VAD, 
and the supplemental VAD's response to the test signal; and 

Figure 8 - illustrates a second conversational test signal provided to a G.729 Annex 
B communication link, the response to the test signal by a standard G.729 Annex B VAD, 
and the supplemental VAD's response to the test signal. 

25 

DETAILED DESCRIPTION OF THE INVENTION 

Figure 2 illustrates representative probability distribution functions for the 
background noise energy 8 and the voice energy 9 at the input of a G.729 Annex B 
communication channel. In this figure, the horizontal axis 12 shows the domain of 
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energy levels and the vertical axis 13 shows the probability density range for the plotted 
functions 8, 9. A dynamic noise threshold 10 is mathematically determined and used to 
mark the upper boundary of the energy domain that is likely to contain background noise 
alone. Similarly, a dynamic voice threshold 1 1 is mathematically determined and used to 
mark the lower boundary of the energy domain that is likely to contain voice energy. The 
dynamic thresholds 10, 1 1 vary in accordance with the noise and voice energy probability 
distribution functions 8, 9, for the time period, T, in which the probability distribution 
functions are established. 

A supplemental algorithm is used to determine the noise and voice thresholds 10, 
1 1 for each period, T, of the established probability distribution functions. This period is 
preferably 1.28 seconds in length and, therefore, the noise and voice thresholds are 
updated every 1.28 seconds. The supplemental algorithm is used to update the noise and 
voice thresholds 10, 11 in the following way. 
Let, 

E max = the maximum block energy measured during the current updating period, T p ; 
E min = the minimum block energy measured during the current updating period,T p ; 
Ti =E min + (E max - E mm )/32; and 
T = 4 * F 

1 2 ^ Sin 1 

The noise energy threshold, T n0ise , and voice energy threshold, T v0ice , are calculated from 
the following equations: 

T noise = min(2 * mi^Tj, T 2 ), -21 dBm); and 

T volce - min(max(a * max(T 1? T 2 ), -65 dBm), -17 dBm); 

where, 

a = 16, when E max / E mm > 2 13 ; and 

a = 4, whenE max /E min < 2 13 . 
Explained textually, T n0ise is calculated for the current updating period, T p , by first 
determining the lesser of the two values T Y and T 2 . The lesser value of T x and T 2 is 
multiplied by two and the product is compared to a value of -21 dBm. Finally, the lesser 
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value of -21 dBm and the product, described in the immediately preceding sentence, is 
assigned to the parameter identifying the noise threshold for the current updating period, 

V 



Similarly explained in a textual way, T volce is calculated for the current updating 
period, T p , by first determining the greater of the two values T r and T 2 . The greater value 
of Ti and T 2 is multiplied by the value of a and the product is compared to a value of -65 
dBm. Next, the greater value of -65 dBm and the product, described in the immediately 
preceding sentence, is compared to a value of -17 dBm and the lesser of the two values is 
assigned to the parameter identifying the voice threshold for the current updating period, 

V 

As an aside, the noise and voice probability distribution functions for each 
updating period, T, may be determined from the sets (E volce (l), E volce (2), E volce (3), . . . , 
E V01ce Q} and {E nolse (l), E n01se (2), E n0]se (3), . . . , E noise (j)}> where j is the highest-valued 
block index within the updating period. These set values are calculated using the 
following equations: 

E volce (n) - (1- a v0ice ) * E volce (n - 1) + <X volce * E(n); and 

E noise (n) = (1- a noise ) * E noise (n - 1) + a n0ise * E(n); 

where, 

E(n) = the n th 5ms block energy measurement within the current updating period, 

a voice = 64" 1 , when E(n) > T V0Ice ; 

^voice = °> when E ( n ) * T voic e ; 
0C noise - 3T\ when E(n) < T volce ; and 

^ volC e^0 ? whenE(n) > T V0lce . 

In addition to updating the noise and voice energy thresholds for each updating 
period, T, the supplemental algorithm compares the two thresholds to the block energy of 
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each incoming frame of the digitized signal to decide when to update the running 
averages of the supplemental background noise characteristics. Whenever the block 
energy of the current frame falls below the noise threshold, the running averages of the 
supplemental background noise characteristics are updated. Whenever the block energy 
of the current frame exceeds the voice threshold, the voice energy characteristics are 
updated. A frame having a block energy equal to a threshold or between the two 
thresholds is not used to update either the running averages of the supplemental 
background noise characteristics or the voice energy characteristics. 

The supplemental VAD algorithm operates in conjunction with a G.729 Annex B 
VAD algorithm, which is the primary algorithm. As described in the Background of the 
Invention section, the primary VAD algorithm compares the characteristics of the 
incoming frame to an adaptive threshold. An update to the primary background noise 
characteristics takes place only if the following three conditions are met: 

1) E f <E favg +3dB; 

2) RC(1)< 0.75; and 

3) aSD< 0.0637; 

In a realistic scenario, the running averages of the background noise characteristics for 
the supplemental algorithm will be updated more frequently than those of the primary 
algorithm. Therefore, the running averages for the background noise characteristics of 
the supplemental algorithm are more likely to reflect the actual characteristics for the next 
incoming frame of background noise. 

A count of the number of consecutive incoming frames that fail to cause an update 
to the running averages of the primary background noise characteristics is kept by the 
supplemental algorithm. When the count reaches a critical value, it may be reasonably 
assumed that the running averages of the primary background noise characteristics have 
substantially diverged from the actual current values and that a re-convergence using the 
G.729 Annex B algorithm, alone, will not be possible. However, convergence may be 
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established by substituting the running averages of the supplemental background noise 
characteristics for those of the primary background noise characteristics. 

Therefore, the supplemental algorithm provides information complementary to that 
of the primary algorithm. This information is used to maintain convergence between the 
expected values of the background noise characteristics and their actual current values. 
Additionally, the supplemental algorithm prevents extremely low amplitude signals from 
biasing the running averages of the background noise characteristics during the 
initialization period. By eliminating the atypical bias, the supplemental algorithm better 
converges the initial running averages of the primary background noise characteristics 
toward realistic values. 

The complementary aspects of the G.729 Annex B and the supplementary VAD 
algorithms are discussed in greater detail in the following paragraphs and with reference 
to Figures 3 and 4. Although the two VAD algorithms are preferably separate entities 
that executed in parallel, they are illustrated in Figures 3 and 4 as an integrated process 14 
for ease of illustration and discussion. 

When a communication link is established, the integrated process 14 is started 15. 
Acoustical analog signals received by the microphone of the transmitting side of the link 
are converted to electrical analog signals by a transducer. These electrical analog signals 
are sampled by an analog-to-digital (AID) converter and the sampled signals are 
represented by a number of digital bits. The digitized representations of the sampled 
signals are formed into frames of digital bits. Each frame contains a digital representation 
of a consecutive 10 ms portion of the original acoustical signal. Since the microphone 
continually receives either the speaker's voice or background noise, the 10 ms frames are 
continually received in a serial form by the G.729 Annex B VAD and the supplemental 
VAD. 
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A set of parameters characterizing the original acoustical signal is extracted from 
the information contained within each frame, as indicated by reference numeral 16. 
These parameters are the autocorrelation coefficients, which are derived in accordance 
with Recommendation G.729, and are denoted by: 

{R(i)}«,__ 0 , where q = U 

The update to the minimum buffer 17, as described in G.729, is performed after the 
extraction of the characterization parameters. 

A comparison of the frame count with a value of thirty-two is performed, as 
indicated by reference numeral 18, to determine whether an initialization of the running 
averages of the noise characteristics has taken place. If the number of frames received by 
the G.729 Annex B VAD having a full-band energy equal to or greater than -70 dBm, 
since the last initialization of the frame count, is less than thirty-two, then the integrated 
process 14 executes the noise characteristic initialization process, indicated by reference 
numerals 23-25 and 27. 

Occasionally, a communication link may have a period of extremely low-level 
background noise. To prevent this atypical period of background noise from negatively 
biasing the initial averaging of the noise characteristics, the integrated process 14 filters 
the incoming frames. A comparison of the current frame's full-band energy to a reference 
level of -70 dBm is made, as indicated by reference numeral 23. If the current frame's 
energy equals or exceeds the reference level, then an update is made to the initial average 
frame energy, E w the average zero-crossing rate, ZC avg , and the average line spectral 
frequencies, LSF, avg , as indicated by reference numeral 24 and described in 
Recommendation G.729 Annex B. Thereafter, the G.729 Annex B VAD sets an output to 
one to indicate the detected presence of voice activity in the current frame, as indicated by 
reference numeral 25, and increments the frame count by a value of one 26. If the 
current frame's energy is less than the reference level, the G.729 Annex B VAD sets its 
output to zero to indicate the non-detection of voice activity in the current frame, as 
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indicated by reference numeral 27. After the G.729 Annex B VAD makes the decision 
regarding the presence of voice activity 25, 27, the integrated process 14 continues with 
the extraction of the maximum and minimum frame energy values 33. 

For each received frame having a full-band energy equal to or greater than -70 
dBm, the frame count is incremented by a value of one. When the frame count equals 
thirty-two, as determined by the comparison indicated by reference numeral 19, the 
integrated process 14 initializes running averages of the low-band noise energy, E 1?avg , and 
the full-band energy, E f avg , as indicated by reference numeral 20 and described in 
Recommendation G.729 Annex B. 

Next, the differential values between the background noise characteristics of the 
current frame and running averages of these noise characteristics are generated, as 
indicated by reference numeral 21. This process step is performed after the initialization 
of the running averages for the low- and full-band energies, when the frame count is 
thirty-two, but is performed directly after the frame count comparison, indicated by 
reference numeral 19, when the frame count exceeds thirty-two. Recommendation G.729 
Annex B describes the method for generating the difference parameters used by both the 
G.729 Annex B VAD and the supplemental VAD. After the difference parameters are 
generated, a comparison of the current frame's full-band energy is made with the 
reference value of -70 dBm, as indicated by reference numeral 22. 

Referring now to Figure 3, a multi-boundary initial G.729 Annex B VAD decision 
is made 28 if the current frame's full-band energy equals or exceeds the reference value. 
If the reference value exceeds the current frame's full-band energy, then the initial G.729 
Annex B VAD decision generates a zero output 29 to indicate the lack of detected voice 
activity in the current frame. Regardless of the initial value assigned, the G.729 Annex B 
VAD refines the initial decision to reflect the long-term stationary nature of the voice 
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signal, as indicated by reference numeral 30 and described in Recommendation G.729 
Annex B. 

After the initial VAD decision has been smoothed, with respect to preceding VAD 
decisions, so as to form a final VAD decision, the integrated process makes a 
determination of whether the background noise energy thresholds have been met by the 
noise characteristics of the current frame, as indicated by reference numeral 31. The 
characteristics of the incoming frame are compared to an adaptive threshold, by the G.729 
Annex B VAD, and an update to the running averages of the G.729 Annex B noise 
characteristics 32 takes place only if the following three conditions are met: 

1) E f <E favg +3dB; 

2) RC(1)< 0.75; and 

3) aSD< 0.0637; 

where, 

E f = the full-band noise energy of the current frame; 
E f ,avg = the average full-band noise energy; 
RC(1) = the first reflection coefficient; and 

aSD = the difference between the measured spectral distance for the current frame 
and the running average value of the spectral distance, with a aSD of 0.0637 
corresponding to 254.6 Hz. The full-band noise energy E f is further updated, as is 
counter C n , according to the following conditions. Set: 
Ef^vg E m i n ; and 

c n = o, 

when, 

C n > 128; and 

F < F 

Textually stated, the running averages of the G.729 Annex B background noise 
characteristics are updated 32 to reflect the contribution of the current frame using a first 
order Auto-Regressive scheme when a frame containing only noise activity is detected. 
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Integrated process 14 measures the full-band energy of each incoming frame. For 
every period, i, of 1.28 seconds, the maximum and minimum full-band energies are 
identified 33 and used to generate the noise threshold 34 for the next period, i+1. This 
process of identifying maximum and minimum full-band energies, E max and E mm , during 
period i to generate the noise threshold, T noise i+1 , for the next time period is performed 
when any of the following conditions are met: 

1 . a G.729 Annex B VAD output decision is made while the frame count is 
less than thirty-two; 

2. the G.729 Annex B background noise energy thresholds are not met, as 
determined in the step identified by reference numeral 3 1; or 

3. an update to the running averages of the G,729 Annex B background noise 
characteristics is made, as identified by reference numeral 32. 

The value of T noise i for the first time period, i, is initialized to -55 dBm. For all 
subsequent periods, i, the supplemental algorithm generates the noise threshold 10 in the 
following way: 

T nolse = min(2 * min(T„ T 2 ), -21 dBm), 

where, 

Ti — E min + (E max - E mm )/32, 

E max = the maximum block energy measured during the current updating period,T p ; 
and 

E min = the minimum block energy measured during the current updating period,T p ; 

Next, the full-band energy of the current frame is compared to the -70 dBm 
reference and to the noise threshold, T noise , 10 generated by the supplemental VAD 
algorithm, as indicated by reference numeral 35. If the full-band energy of the current 
frame equals or exceeds the reference level and equals or falls below the noise threshold 
10, T noise , then the running averages of the background noise characteristics, generated by 
the supplemental VAD algorithm, are updated using the autoregressive algorithm 
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described for the G.729 Annex B VAD. This update is indicated in the integrated process 
flowchart 14 by reference numeral 36. 

Thereafter, or if a negative determination was made for the current frame in the 
comparison identified by reference numeral 35, a decision is made whether to update the 
noise threshold 10, as indicated by reference numeral 37. If about 1.28 seconds has 
passed since the last update to the noise threshold 10, then the noise threshold is updated 
based upon the maximum and minimum full-band energy levels measured during the 
previous time period, as indicated by reference numeral 38. 

Next, a decision is made whether to compare the running averages of the 
background noise characteristics maintained by the separate G.729 Annex B and the 
supplemental VAD algorithms, as indicated by reference numeral 39. A decision to 
compare the noise characteristics of the separate VAD algorithms may be based upon an 
elapsed time period, a particular number of elapsed frames, or some similar measure. In a 
preferred embodiment, a counter is used to count the number of consecutive frames that 
have been received by the integrated process 14 without the G.729 Annex B update 
condition, identified by reference numeral 31, having been met. When the counter 
reaches the particular number of consecutive frames that optimally identifies the critical 
point of likely divergence between the running averages of the background noise 
characteristics generated using the separate G.729 Annex B and supplemental VAD 
algorithms, a comparison between these two sets of characteristics is made. This 
comparison between the two sets of noise characteristics is made in the process step 
identified by reference numeral 40. 

If the running averages of the background noise characteristics calculated using the 
G.729 Annex B and supplemental VAD algorithms have diverged, then the values for 
these characteristics generated by the supplemental VAD algorithm are substituted for the 
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respective values of these characteristics generated by the G.729 Annex B algorithm. The 
substitution occurs in the step identified by reference numeral 41 . 

Thereafter, a determination of whether the link has terminated and there are no 
more frames to act on is made, as indicated by reference numeral 42, if any of the 
following conditions are met: 

1 . a negative determination is made in the step identified by reference numeral 

39 regarding whether the optimal time has arrived to compare the running averages 
of the background noise characteristics generated by the G.729 Annex B and the 
supplemental VAD algorithms; 

2. a negative determination is made in the step identified by reference numeral 

40 regarding whether the running averages of the background noise characteristics 
generated by the G.729 Annex B and the supplemental VAD algorithms have 
diverged; or 

3. the running averages of the background noise characteristics from the 
supplemental algorithm have been substituted for the respective values of the these 
characteristics from the G.729 Annex B algorithm, in the step identified by 
reference numeral 41. 

If the last frame of the link has been received by the G.729 Annex B VAD, then the 
integrated process 14 is terminated, as indicated by reference numeral 43. Otherwise, the 
integrated process 14 extracts the characterization parameters from the next sequentially 
received frame, as indicated by reference numeral 16. 

Referring now to Figure 5, a test signal 58 representing a speaker's voice is 
provided to a G.729 Annex B communication link. The G.729 Annex B VAD produces 
the output signal 45 in response to the incoming test signal 58. The horizontal axis of 
graph 46 has units of time and the horizontal axis of graph 47 has units of elapsed frames. 
The vertical axes of both graphs have units of amplitude. An amplitude value of one for 
the VAD output signal 45 indicates the detected presence of voice activity within the 
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frame identified by the corresponding value along the horizontal axis. An amplitude 
value of zero in the VAD output signal 45 indicates the lack of voice activity detected 
within the frame identified by the corresponding value along the horizontal axis. 

Figure 6 illustrates the test signal 44 of graph 46 with a low-level signal 54 
preceding it. Low-level signal 54 is generated by the analog representation of six 
hundred and forty consecutive zeros from a G.729 Annex B digitally encoded signal. 
Together, the test signal 44 and its analog representation of the six hundred and forty 
zeros forms the test signal 48 in graph 51. Graph 52 illustrates the G.729 Annex B VAD 
response 49 to the test signal 48. Similarly, graph 53 illustrates the supplemental VAD 
algorithm response 50 to test signal 48. Notice in graph 52 that the G.729 Annex B VAD 
identifies all incoming frames as voice frames, after some number of initialization frames 
have elapsed. Because the G.729 Annex B VAD has received a very low-level signal 54 
at the onset of the channel link for more than 320ms, the VAD's characterization of the 
background noise has critically diverged from the expected characterization. As a result, 
the G.729 Annex B VAD will not perform as intended through the remaining duration of 
the established link. The supplemental VAD algorithm ignores the effect of the low-level 
signal 54 preceding the test signal 44 in combined signal 48. Therefore, the atypical noise 
signal does not bias the supplemental VAD's characterization of the background noise 
away from its expected characterization. It is instructive to note that the supplemental 
VAD's response to signal 44 in graph 53 is identical, or nearly so, to the G.729 Annex B 
VAD's response to signal 44 in graph 47. 

Figure 7 illustrates a conversational test signal 55, in graph 58, provided to a 
G.729 Annex B communication link. Graph 59 illustrates the response 56 to test signal 
55 by a standard G.729 Annex B VAD and graph 60 illustrates the supplemental VAD's 
response 57 to test signal 55. A comparison of the supplemental VAD response to the 
standard G.729 Annex B response shows that the former provides better performance in 
terms of bandwidth savings and reproductive speech quality. 
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Figure 8 illustrates another conversational test signal 61 provided to a G.729 
Annex B communication link. Graph 64 illustrates the response 48 to test signal 61 by a 
standard G.729 Annex B VAD and graph 65 illustrates the supplemental VAD's response 
63 to test signal 6 1 . A comparison of the supplemental VAD response to the standard 
G.729 Annex B response shows that the former has five percent more noise frames 
identified than the latter. Therefore, the supplemental VAD algorithm is shown to better 
converge with the expected characteristics of the current frame. 

Because many varying and different embodiments may be made within the scope 
of the inventive concept herein taught, and because many modifications may be made in 
the embodiments herein detailed in accordance with the descriptive requirements of the 
law, it is to be understood that the details herein are to be interpreted as illustrative and 
not in a limiting sense. 



Appl.No. T32794 



22 



